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Abstract 

Simple hidden Markov models are proposed for predicting secondary structure of a protein from its 
amino acid sequence. Since the length of protein conformation segments varies in a narrow range, we 
ignore the duration effect of length distribution, and focus on inclusion of short range correlations of 
residues and of conformation states in the models. Conformation-independent and -dependent amino 
acid coarse-graining schemes are designed for the models by means of proper mutual information. We 
compare models of different level of complexity, and establish a practical model with a high prediction 
accuracy. 

PACS number(s): 87.10.-t-e,02.50.-r 



1 Introduction 

Methods for predicting the secondary structure of a protein from its amino acid sequence have been devel- 
oped for 3 decades. Besides neural network models and nearest-neighbor methods, the Chou-Fasman/GOR 
statistical method is vifell-established and commonly used. In 1974, assuming an oversimplified independency 
to cope with the large size 20 of the amino acid alphabets at a small size of database, Chou and Fasman 
(1974) derived a table of propensity for a particular residue to be in a given secondary structure state. By 
combining with a set of rules, the protein secondary structure was predicted using this propensity. Later, in 
the first version of the GOR program (Garnier, Osguthorpe, and Robson, 1978), the state of a single residue 
ai was predicted according to a window from i — 8 to i -I- 8 surrounding the residue. Unlike Chou-Fasman 
which assumes that each amino acid individually influences its own secondary structure state, GOR takes 
into account the influence of the amino acids flanking the central residue on the central residue state by 
deriving an information score from the weight matrix describing 17 individual amino acid frequencies at 
sites i -\- k with —8 < k < +8. By using a single weight matrix, the correlation among amino acids within 
the window was still ignored. In the later version GOR III (Gibrat, Garnier, and Robson, 1987), instead of 
single weight matrix for every structure state, 20 weight matrices, each of which corresponds to a specific 
type of the central residue, were used. These conditional weight matrices take the pair correlation between 
the central residue and a flanking one into account. In the most recent version of GOR (GOR IV, Garnier, 
Gibrat, and Robson, 1996), all pairwise combinations of amino acids in the flanking region were included. 

Hidden Markov models (HMMs) (Rabiner, 1989) have been applied to molecular biology, in particular 
in gene-finding. There is a constant tendency in developing HMMs for protein structure prediction (Asai et 
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al.,1993; Stultz et al., 1993; Karplus et al., 1997; Camproux et al. 1999). A probabilistic approach similar to 
the gene finder Genscan has been developped for protein secondary structure prediction without using sliding 
windows (Schmidler, Liu and Brutlag, 2000). In terms of Bayesian segmentation, the mothod integrated 
explicit models for secondary structure classes helices, sheets and loops with other observed structure aspects 
such as segment capping signals and length distributions, and reached an accuracy comparable to GOR. 

Compared with DNA sequences, protein sequences arc generally short, and their amino acid alphabet 
is of a large size 20. The range of lengths of secondary structure segments is rather small. The effect of 
duration might play a less important role. Here we develop a simple hidden Markov model with higher order 
correlations included for the secondary structure prediction. We propose several schemes for the amino acid 
alphabet reduction in order to incorporate residue correlation in the model. While the model is much simpler 
than the Bayesian segmentation model, its performance is still competitive. 



A simplified version of the model can be constructed in the frame of the Chou-Fasman propensity scheme. 
We shall start with this model to explain several key points, and then discuss more realistic models. 

In the Ghou-Fasman approach, discriminant thresholds and post-prediction filtering rules are required. 
They can be avoided in a full probabilistic model. 

As in the most methods, we consider 3 states {h, e, c} generated from the 8 states of Kabsch and Sander 
(1983) by the coarse-graining H,G,I ^ h, E ^ e and X, T,S,B — > c. Let R = Ri;n = R1R2 ... -Rn be a 
sequence of n amino acid residues, and its corresponding secondary structure sequence he S = S1S2 ■ ■ ■ Sn- 
The structure prediction is the mapping from R to S. The main restriction to a structure sequence is that 
the shortest length of the consecutive state h must be 3, and that of e be 2. To cope with this restriction, 
we use triplet states instead of the 3 single states c, e and h. In the total 27 triplets, only 19 are legitimate. 
The forbidden 8 are of the type eee or hhh, where e indicates a non-e, i.e. either c or h, and the meaning 
of h is analogous. The first order Markov model for the triplets is the third order Markov model for the 
original mono-states. Any of the 19 triplets can only transit cither to 3 or to 1 state. That is, the transition 
matrix is rather sparse. The 19 triplets and their transited states arc listed in Table 1. 

Denoting by tr^ the triplet states, we may translate Si:n to S2:n-i = o'2<J3 . . . a„-i of length n — 2. (Note 
that the subscripts i for Ui are from 2 to n — 1 for convenience.) The Markov process for <j is characterized 
by the set of probabilities for initial states 
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-k{1) = Prob(cr2 = I), 



(1) 



and the transition rates 



Tki = Prob(o-,; = l\ai-i 



k) = T((Tj_i,cri). 



(2) 



Sequence R is then related to S or 5 by the emission probabilities 



P{x\5) = Prob(Ei = x\ai = 5), 



(3) 



which generate S^-.n-i- Extra probabilities 



Q{x\5) = Prob(i?j = x\aj±i = 6),j e {l,n} 



(4) 



are needed for generating Ri and R 
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In this model the probabihty for the state sequence 5 or S is 



n-2 

P{E)=n{a2)l[T{ai,ai+,), (5) 

and the likehhood for R to be at S is 

n-l 

P(i?|E) = Q{Si\a2)Q{Sn\an-i) n P{Si\ai). (6) 

The joint probability is then 

P{R, S) = P{R, S) = P{R\i:)P{T,). (7) 
The predicted structure is infered as 

E* = argmax2P(S|i?) oc argmaxsP(i?, E). (8) 

By means of the recursion relation 

r2(a) = Q{Si\a)n{a)P{R2\a), (9) 
Tiia) = maj^sTi-i{5)T{S,a)P{Ri\(T), 2 < i < n, (10) 
r = r„=max5r„_i(5)Q(5„|5), (11) 

and by recording the pre-state leading to the maximal Ti{a), the 'best' path S* can be traced back from the 
last This is the so-called Viterbi algorithm for dynamic programming. 

According to the standard forward-backward algorithm for HMMs, the forward and backward variables 
Ai and Bi may be defined as follows. 

A2{a) = Q{Si\a)n{a)P{R2\<j), (12) 

M'^) = ^Ai_i{5)T{S,a)P{Ri\a), 2 < i < n. (13) 
s 

Similarly, 

Bn-i{a) - Q{Sn\a), (14) 
-S'W = ^Si+i(<5)r(cT,5)P(i?»+i|5), 2<i<n-l. (15) 

(16) 

It can be seen that for non-ending i, Ai{a) = Prob(i?i;i, crj = a), Bi{a) = Prob(i?i+i:„|cri = a), and the 
partition function 

Z = Y.P{R,i:)=Y,M<^)Bi{a) (17) 
= Y.M'^Wa), 2<i<n-l. (18) 

Denoting a^^^ the center Si of the triplet Uj = <S'i_iS'j<S'i+i, and introducing the characteristic function for 
z G {c, e, /i} 

0, otherwise, 
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we may infer single residue state from the marginal posterior 

Pro^^i = z\R) o^J2M'^)B^ic7)6ia,z). (20) 

a 

This is the Baum-Wclch algorithm for single residue states. 

So far, only the correlation of conformation states has been considered. Residue triple will involve 
20^ = 8000 parameters for each state a. To avoid large training sets and model overfitting, a reduced 
amino acid alphabet is desired. For example, the reduction of 20 amino acids to 3 classes leads to only 27 
combinations. However, there are as many as 

l^C^(-iy(3-j)2°«5.8xlO« 

ways of clustering 20 amino acids into 3 classes (Duran and OdcU, 1974). For a given clustering {ai}J£g — > 
{bj}, bj G {0, 1, 2}, denoting by pi = Vi-iriTi+i the reduced residue triple corresponding to state ai, we may 
calculate the mutual information between the reduced residue triple p and the triple state a: 

I{p,cj) = H[p\+H[a]-H[p,a], (21) 

where H[x\ is the Shannon entropy of x. (If the clustering independent H[o] is ignored, / would become the 

conditional entropy _ff[(T|/3].) The best clustering may be determined by maximizing the objective function 
I{p,a). The replacement of the above P{Ri\ai) with P{pi\ai) leads to a version which takes some residue 
correlation into account. 

A more realistic model uses quintuplets. Among the total 3^ = 243 quintuplets of the conformation 

states, only 75 are legitimate. Exclusion of 7 rare ones {eceeh, hceeh, heece, heech, heeeh, heehh, and 
hheeh) further reduces the total number of states into 68, which are listed in Table 2. We shall still use the 
same notation aj for these 68 conformation states. To take residue quintuplet correlation into account, we 
substitute the central residue score P{Ri\ai) with P{Ri\ai,ri-2ri-iri+iri+2), where r, stands for reduced 
residue classes. More words need to be said about the amino acid clustering. We have observed the fact 
that counts of ccccc, eeeee and hhhhh in databases are dominant over those of remaining 65. The various 
propensities of amino acid residues to different conformations imply that amino acid clustering should depend 
on conformations. We want to cluster amino acids separately for each conformation. For this purpose, for 
example, to find the best clustering at conformation c, we collect a subset of residue quintuplets whose 
conformation is ccccc. Denote by Rq the central residue of a residue quintuplet, and by r_2r_irir2 the 
reduced classes of the other 4 residues. Taking the mutual information /(i?o, ?'_2r_irir2) as the objective 
function, we determine the best clustering at c. To find at which position the residue depends most strongly 
on others, we calculate mutual information for nonreduced residue placed at different positions of a residue 
quintuplet. While the largest I is found when the nonreduced residue is at the center for conformation e, 
the position of the nonreduced residue which gives the largest / is the second position for conformation c, 
and is the fifth for h. However, for either c or h, the largest / is still very close to I{Ro,r-2r-irir2)- The 
mutual information excesses with respect to I{Ra,r^2f-i'rir2) for different positions at conformations c, e 
and h are listed in Table 3. Thus, for simplicity, we always place the nonreduced residue at the center for 
all conformations. We then calculate refined residue scores from the conformation-dependent clustering. For 
example, 

P{Ri\(Ti = ccchh, ri_2ri-iri+iri+2) P{Ri\(Ti, r?_2r?_irf+irf+2), (22) 
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where the superscript of indicates its conformation. The whole procedure for the dynamic programming 
remains almost the same, except for some care when dealing with two more end sites. 

3 Result 

We create a nonredundant set of 1612 non-membrane proteins for training parameters from PDB_SELECT 

(Hobolim and Sander, 1994) with amino acid identity less than 25% issued on 25 September of 2001. The 
secondary structure for these sequences are taken from DSSP database (Kabsch and Sander, 1983). As 
mentioned above, the eight states of DSSP are coarse-grained into 3 states: h, e and c. This learning set 
contains 268031 residues with known conformations, among which 94415 are h, 56510 are e, and 117106 are 
c. The size of the learning set is reasonable for training our parameters. There are 296 unknown residues. 
We add an extra 'unknown' amino acid category called X to the 20 known ones. 

In order to assess the accuracy of our approach, we use the following 2 test sets: Sets 1 and 2. A set of 124 
nonhomologous proteins is created from the representative database of Rost and Sander (1993) by removing 
subunits A and B of hemagglutinin 3hmg, which are designated as membrane protein by SCOP (Murzin 
et al, 1995). The 124 sequences and the learning set are not independent of each other according to HSSP 
database (Dodge, Schneider and Sander, 1998). That is, some proteins of the 124 sequences are homologous 
with certain proteins in the learning set. Removing the homologous proteins from the 124 sequences and 5 
seuqences with unknown amino acid segments longer than 6, we construct Set 1 of 76 proteins. Nonredundant 
34 proteins with known structures of the CASP4 database issued in December of 2000 are taken as Set 2 
(CASP4, 2000). 

3.1 Amino acid clustering 

The first method for clustering uses the mutual information I{p, a) between the reduced oligo-peptide p 
and its corresponding conformation a. Setting the number of reduced classes at 3, 4 and 5, and fixing cr to 
bo triplets, we find the conformation-independent clustering of amino acids as shown in Table 4. Roughly 
speaking, class is the hydrophobic, and is the same for the all 3 clusterings (except for the 'unknown' X in 
the clustering into 5). Increasing the number of classes from 3 to 4, 5 results in new classes forming by single 
special amino acids P(Pro) and G(Gly). The similar clustering may be conducted in terms of quintuplets, 
but the results do not coincide with those from triplets. 

The second method for clustering is conformation-dependent. We collect residue quintuplets of conforma- 
tions ccccc, eeeee and hhhhh separately. The objective function for clustering now is the mutual information 
I{Ro,r-2r-irir2\cr) with conformation a fixed. The results of clustering into 3 and 4 classes are listed in 
Table 5. Indeed, the cluster patterns for different conformations are quite dissimilar. 

3.2 Secondary structure prediction 

We shall index different models by Ng-Na, where Ng and Na are the numbers of conformation states and 
residue combinations, respectively. For example, the simplest model is model 19-21, which uses P{Ri\Si-iSiSi+i) 
to score residues. We may also use the propensity scores P{Ri\Si-iSiSi+i)/P{Ri), which result in an extra 
factor nr=o Since the factor is independent of conformation sequences, it brings no new effect. When 

the scores arc replaced by P{ri-iriri+x\Si-iSiSij^{) with n being the reduced 3-class residues, we have 
model 19-3 x 3 x 3. The first and last sites of the predicted conformation of any sequences are always set at 
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conformation c. For model 19-21 we determine the best conformation sequence as whole by the Viterbi algo- 
rithm and single residue conformation states by the Baum- Welch algorithm. To assess prediction methods, 
we calculate for each conformation the sensitivity s„ and specificity Sp 

_ TP TP 
^" - TP + FN' ^ TP + FP' ^ 

where TP, FP and FN are site counts of the 'true positive', 'false positive' and 'true negative' with respect 
to the observed real conformation. The results of model 19-21 are listed in Table 6, where the total sensitivity 
Qs for all conformations is also given. It is clearly seen that the inference from the Baum- Welch marginal 
posterior is significantly superior to that from the Viterbi algorithm in Qs value. 

The next examined model uses reduced residue triplets. Reducing amino acids into 3 and 4 classes, we 
have models 19-27 and 19-64, respectively. The prediction accuracies of these two models are also listed in 
Tabel 6. We see that the inclusion of residue correlation dramatically improves the prediction accuracy. 

The remaining part of Table 6 shows the prediction accuracies of quintuplet models. We examine the 
models on both test Sets 1 and 2. For all the models we take P(_Ri|ri_2''i-i''i+i'''i+2, cti) as the residue 
scores. Wo first compare conformation-independent with conformation-dependent clustering of amino acids. 
We find that the conformation-dependent clustering gains about 1 percent in the prediction accuracy for 
model 68-81x21. The conformation-independent clustering is then not considered later on. Model 68-256x21 
contains more information about correlated residues, and has a better performance. The accuracies obtained 
on Set 1 arc generally higher than those on Set 2. Besides the popular predictor GOR IV, there is another 
secondary structure predictor SSP (Solovycv and Salamov, 1991, 1994) based on discriminant analysis using 
single sequence. To compare with them, their accuracies on the same test sets are also listed in the table. 



4 Discussions 

We have presented simple hidden Markov models to predict secondary structure using single protein sequence. 
The hidden sequence is generated by a Markov process of multi-site conformation states. Considering 
that structure segments of proteins are generally short, we have ignored the duration effect, and focused 
on short range correlations. We proposed several schemes for coarse-graining the amino acid alphabet in 
order to include multi-residue correlation. Such reduction has been used in the Bayesian segmentation of 
protein secondary structure (Schmidler, et. al., 2000). However, here wc derived the coarse-graining schemes 
specially for scoring residues to fit conformations. We have discussed only the principle of taking proper 
mutual information as an objective function for clustering, but did not exhaust all possibilities for clustering. 
For example, to diminish parameters, one may consider residue triplets at conformation quintuplet states. 
Another possibility is to use quartuples for both residues and conformations. One can also cluster quintuplet 
conformation states to less states. 

There are rooms for further improvement of our approach. Simple weights (3 values) may be introduced 
to adjust residue scores according to its single-site conformation. Moreover, we may divide a training set into 
several, say 2, subsets according to residue statistics. Again, for this purpose the coarse-graining schemes 
help. The two subsets are then used separately for training to get refined models. We first classify a query 
sequence into one of the two categories, and then apply to it the corresponding refined model. Wc have tested 
this on the simple triplet models. The primitive result of up to 2% in accuracy improvement is encouraging. 
This, and the protein family recognition using multiple reduced amino acid residues, are under study. 
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Table 1. Triplet states and their transited states. 



State 


Transited 


State 


Transited 





CCC 


0, 1, 2 


10 


EEH 


11 


1 


CCE 


3 


11 


EHH 


18 


2 


CCH 


4 


12 


HCC 


0, 1, 2 


3 


GEE 


8, 9,10 


13 


HCE 


3 


4 


CHH 


18 


14 


HCH 


4 


5 


ECC 


0, 1,2 


15 


HEE 


8, 9,10 


6 


ECE 


3 


16 


HHC 


12,13,14 


7 


ECH 


4 


17 


HHE 


15 


8 


EEC 


5, 6, 7 


18 


HHH 


16,17,18 


9 


EEE 


8, 9,10 









Table 2. 68 quintuplet states. 






CCCCC 


17 


CHHHE 


34 


EEECH 


51 


HEECC 


1 


CCCCE 


18 


CHHHH 


35 


EEEEC 


52 


HEEEC 


2 


CCCCH 


19 


ECCCC 


36 




53 


HEEEE 


3 


CCCEE 


20 


ECCCE 


37 


EEEEH 


54 


HHCCC 


4 


CCCHH 


21 


ECCCH 


38 


EEEHH 


55 


HHCCE 


5 


CCEEC 


22 


ECCEE 


39 


EEHHH 


56 


HHCCH 


6 


CCEEE 


23 


ECCHH 


40 


EHHHC 


57 


HHCEE 


7 


CCEEH 


24 


ECEEC 


41 


EHHHE 


58 


HHCHH 


8 


CCHHH 


25 


ECEEE 


42 


EHHHH 


59 


HHEEC 


9 


CEECC 


26 


ECHHH 


43 


HCCCC 


60 


HHEEE 


10 


CEECE 


27 


EECCC 


44 


HCCCE 


61 


HHHCC 


11 


CEECH 


28 


EECCE 


45 


HCCCH 


62 


HHHCE 


12 


CEEEC 


29 


EECCH 


46 


HCCEE 


63 


HHHCH 


13 


CEEEE 


30 


EECEE 


47 


HCCHH 


64 


HHHEE 


14 


CEEEH 


31 


EECHH 


48 


HCEEC 


65 


HHHHC 


15 


CEEHH 


32 


EEECC 


49 


HCEEE 


66 


HHHHE 


16 


CHHHC 


33 


EEECE 


50 


HCHHH 


67 


HHHHH 



Table 3. Mutual information excesses (xlO~^) for different positions of the nonreduced residue with 
respect to the mutual information for the nonreduced residue at the quintuplet center. The center is referred 
to as position 3. 





Position of the nonreduced 


Conformation 


1 


2 


4 


5 


c 


-19.5 


2.7 


-3.0 


-7.4 


e 


-46.1 


-10.2 


-15.5 


-82.3 


h 


-49.6 


-33.8 


-0.5 


1.8 
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Table 4. Conformation-independent clustering of amino acids into 3, 4 and 5 classes. 



Amino Acid 


A 


V 


C 


D 


E 


F 


G 


H 


I 


W 


K 


L 


M 


N 


Y 


P 


Q 


R 


S 


T 


X 


3-class 


1 








2 


1 





2 


1 








1 








2 





2 


1 


1 


1 


1 


1 


4- class 


1 








2 


1 





2 


2 








1 








2 





3 


1 


1 


2 


2 


1 


5-class 


1 








2 


1 





4 


2 








1 








2 





3 


1 


1 


2 


2 





Tabic' 5. Conformati()n-d('p(uid«i1 


(' 


hist( 


ririR 


of amino 


acids into 3 and 4 


clas 


ses. 














Amino Acid 


A 


V 


c 


D 


E 


F 


G 


H 


I 


W 


K 


L 


M 


N 


Y 


P 


Q 


R 


S 


T 


X 


c, 3-class 


1 


2 








1 


2 








2 


2 


1 


2 


2 





2 





1 


1 


1 


1 


2 


e, 3-class 





2 





1 


1 


2 





1 


2 


2 


1 








1 


2 





1 





1 


1 


1 


h, 3-class 





2 





1 


1 


2 


1 


1 


2 


2 


1 


2 


2 


1 


2 


1 


1 


1 


1 


1 





c, 4-class 


1 


2 


2 





1 


2 


3 


1 


2 


2 


1 


2 


2 





2 





1 


1 


1 


1 


2 


e, 4-class 





1 


2 


3 


3 


1 


2 





1 





3 


2 


2 


3 


1 


2 


3 








3 


2 


h, 4-class 


1 


2 


1 


3 





2 


3 


3 


2 


2 





2 


2 


3 


1 


3 


3 





3 





1 



Table 6. Accuracy of secondary structure predictions for different models. Here, VI and BW stand for 
'Viterbi' and 'Baum- Welch' algorithms, respectively. For all models 68-k, conformation-dependent reduc- 
tions are used except for model 68-81x21* where the single conformation-independent reduction is used. 



Model 


Test sot 


CC 


Sp 


ce 
'-'n 


Sp 


ch 


ch 
Op 




19-21, VI 


2 


48.69 


62.92 


7.94 


64.09 


84.17 


48.07 


53.24 


19-21, BW 


2 


70.67 


62.27 


30.74 


54.55 


66.12 


60.11 


60.45 


19-27, VI 


2 


62.50 


63.87 


56.02 


46.58 


55.82 


61.48 


58.63 


19-64, VI 


2 


64.10 


65.31 


50.11 


47.59 


62.45 


63.06 


60.50 


68-81x21*, VI 


2 


60.12 


69.16 


44.54 


54.89 


75.48 


60.26 


62.53 


68-81x21, VI 


2 


62.01 


68.21 


44.26 


57.79 


76.51 


61.98 


63.64 


68-81x21, BW 


2 


74.51 


61.71 


47.18 


59.99 


61.52 


68.93 


63.83 


68-81x21, VI 


1 


70.92 


66.51 


54.27 


53.45 


62.00 


69.07 


64.46 


68-81x21, BW 


1 


71.01 


68.98 


52.97 


57.58 


67.37 


66.69 


66.00 


68-256x21, VI 


2 


65.35 


68.41 


54.73 


58.41 


70.11 


64.58 


64.86 


68-256x21, BW 


2 


70.90 


68.15 


57.60 


60.86 


68.85 


69.84 


67.30 


68-256x21, VI 


1 


68.34 


69.97 


55.34 


57.68 


70.12 


66.23 


66.18 


(i8-2.")()x21. BW 


1 


7:-!. 22 


(i!).77 


.")7.4() 


59. (il 


(i7.2() 


7().:-!8 


(i7.<)() 


G0R4 


1 


79.3 


66.1 


54.7 


55.3 


63.3 


68.5 


66.2 


SSP 


1 


59.2 


52.8 


69.0 


55.3 


67.0 


68.1 


60.0 


G0R4 


2 


81.9 


62.0 


43.0 


54.6 


67.1 


64.3 


63.4 


SSP 


2 


74.7 


58.8 


45.7 


55.6 


66.3 


63.3 


61.4 
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